
The reliability pillar of the AWS Well-Architected Framework [pdf] - robfig
https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf
======
inlined
The section about calculating the availability of hard and redundant
dependencies ignores the fact that systems often fail in tandem. For e.g. you
might have a primary and secondary database in different AZs, each with 99.9%
availability. This gives read operations a hypothetical uptime of 6-9s. But
hidden SPOFs like operator error, VM infra failing, load balancer or other
networking outages can make the 0.01% failure time overlap, blowing away your
6-9s dependency guarantee.

~~~
koolba
Many run into this when they learn the hard way that “RAID is not a backup” as
wiping out all your data on one set of disks also wiped it on the other. Ditto
for naive replication with no rollback capability. It’s a fancier version of
the same thing.

“DROP TABLE customer”

“ _Oh crap! Better failover to DR. Oh crap! It’s dropped there too?!_ ”

~~~
earless1
exponentially delayed replication down the replica chain...

~~~
javajosh
Why increase the interval at all? Why not just equally, or at most linearly
increasing, delayed replication?

~~~
nine_k
With replication at 1 minute intervals and ten replicas, you can step back at
most 10 minutes.

Same with replicas at 1, 2, 4, 8, etc minutes, ten replicas let you step back
up to 17 hours.

Much more time to detect and correct a catastrophic error.

------
Jedi72
This is total submarine marketing by AWS. Im sure some very smart people with
good intentions at AWS produced this, but the fact is they're not an
independent academic providing objectively "the best" practices, they're a
company trying to sell you something. They probably in full belief in their
own abilities believe the AWS way of doing things is best, but frankly, I know
as much as you do, or at least if you want to win me over you need to
comprehensively prove it, not just preach from the hill.

TL:DR stop telling me you're doing better architectures in the same of selling
me something

~~~
sciurus
I don't look at the well-architected framework as "the AWS way is the best way
of doing things" but rather "this is the best way of doing things on AWS".

------
random42
> "... For example, if a system makes use of two independent components, each
> with an availability of 99.9%, the resulting system availability is >99.999%
> ... "

This does not seem correct.

~~~
maltalex
What they probably mean is that if each component can fail _independently_
with a probability of 0.001 (0.1%) then the probability of both of them
failing is 0.001 * 0.001 = 0.000001 (0.0001%)

If the system depends on just one of the components working then 1 - 0.000001
= 0.999999 (99.9999‬%)

~~~
pkolaczk
This is correct, however in reality completely independent components are very
rare. Even things that seem independent and truly redundant e.g. jet engines
of an airliner, are much more likely to fail after one of them fails.
Therefore this line of reasoning must be applied with extreme care.

~~~
noir_lord
Indeed, contaminated fuel would do it.

~~~
ElKrist
I would not model the risk on fuel as part of the risk on jet engines.

~~~
cperciva
Other correlated risks include weather, birds, thrust, age, time since last
maintenance...

In fact, I'm having trouble coming up with any common cause of engine failure
which _isn 't_ correlated.

~~~
noir_lord
Manufacturing defect either material or user error maybe?

~~~
MaxBarraclough
Many manufacturing defects affect whole batches of units.

People serious about preserving data with redundant arrays, tend to be careful
to avoid using multiple drives from the same batch.

I vaguely recall a cloud backup provider losing customer data because they
failed to do this. Annoyingly I can't find it on google.

------
iblaine
Interesting that security is 1 of 5 pillars and hardly gets a mention in this
paper. Who wrote this and why?

~~~
maxmcd
[https://d1.awsstatic.com/whitepapers/architecture/AWS-
Securi...](https://d1.awsstatic.com/whitepapers/architecture/AWS-Security-
Pillar.pdf)

~~~
iblaine
Thank you

------
crankylinuxuser
(erased)

~~~
klodolph
Yep. Amazon (GCP, Azure) are making bank off the idea that you can just pay
opex to run in multiple regions and lay off your SREs, but if you really
wanted high availability (and not just some service credits when things go
wrong) you would spend the engineering effort to go multi-provider. At that
point on-prem looks better, but at least with multi-provider you’re no longer
in the business of ordering hardware and power.

The idea that failures between different systems are uncorrelated doesn’t even
work at a hardware level (e.g. two different sticks of RAM), it’s pure fantasy
when you’re talking about software stacks, especially when you have things
like the blow-up-the-world button called “BGP.” I’m deeply uncomfortable
reading formulas for availability that add or subtract nines, if there’s real
money on the line, you can afford to do some better math.

Disclosure: Work at a cloud provider, opinions are my own.

~~~
cheeze
Something that often goes ignored is that in high availability situations like
that, humans are the biggest risk by a huge margin. Computers (almost always)
don't accidentally wipe out the prod database thinking it was beta, that's
usually a human that made a simple mistake.

Dumping money into IaaS while neglecting to recognize that high availability
requires extreme quality of both engineering, as well as testing/qa is a super
common mistake that's being made these days. This means slowing down, which
most upper management doesn't want to do.

Things like executing DR plans regularly to test that they wofk are extremely
costly to the business,and are one of those "IT" things that you hopefully
never have to actually do in production. But without things like that, what's
the point of 3xing the cost of your fleet and adding complexity (another great
vector to cause failure) for "availability"?

I think the answer often comes down tok the fact that people are hard to hire.
Computers are easy to provision.

~~~
FigmentEngine
Yes, humans are really important in these situations, thats why "Operational
Excellence" is the first pillar of AWS Well-Architected - you can only get so
far in terms of reliability and security if you don't consider people and
process. [https://d1.awsstatic.com/whitepapers/architecture/AWS-
Operat...](https://d1.awsstatic.com/whitepapers/architecture/AWS-Operational-
Excellence-Pillar.pdf) (I work on the AWS Well-Architected team)

