

Cloudy Snake Oil - jmduke
https://blog.pinboard.in/2014/04/cloudy_snake_oil/

======
Lambdanaut
That is the biggest nitpick ever. Nobody on Earth except you read Amazon's
claim the way you did. Usually I'm against referring directly to the person
making the argument rather than the argument itself, but this is just
ridiculous, and it reeks of purposefully misrepresenting Amazon's claims.

Of course all of these insane random catastrophic events could occur. The
numbers given by Amazon are representative of the service they've been able to
responsibly provide _in the past_ , and they're extrapolating from that.
Nobody in their right mind would expect Amazon to maintain those numbers if,
for instance, the Earth spontaneously imploded.

Disclaimer for future readers: this comment was written in the year 2014,
about three years before Amazon built their Moon Base Backup Storage Service™
(MB2S2)

~~~
tbagman
I understand your point of view, but I have a different opinion.

When I see these kinds of claims, my guess is that they are usually based on a
failure analysis given a particular replication degree and estimates of
failure probabilities of various components and failure domains. Underlying
this analysis is usually an assumption about the independence of failures
between the failure domains.

The good news is that industry has moved away from the computer as the unit of
independent failure to a much larger failure domain: often a cluster within a
data center, or an entire data center. This means that the analysis takes into
account the infrequent occurrence of a large number of correlated failures
within the failure domain.

The bad news is that there are inevitably correlated failures across the
failure domains, regardless of how carefully you design to avoid them.
Software bugs, coordinated attacks, operator errors, cascading failures caused
by well-intentioned but runaway control loops and automated failover
mechanisms, and so on, can be the culprit.

So, here's the problem. This statistic from Amazon, if taken at face value,
would say that relying on Amazon to keep your data durable and safe is
practically risk-free to the point of durability issues never happening in
your lifetime (or, alternatively, to such a dramatically small fraction of
objects that you might not care).

In practice, however, I suspect you do want to plan for the "unknown unknowns"
that will cause data loss at low probability, but much higher probability than
0.000000001%.

Here's another way to look at it: I'd love it if Amazon posted some data about
the rate at which they've experienced durability failures in the past year or
two, rather than posting what I'm supposing (I might be wrong!) are
calculations based on assumptions of dependent failure probabilities.

------
spcoll
Any risk estimate makes implicit assumptions, so they are conditional
probabilities rather than absolute probabilities. It is assumed that Amazon
engineers mean to prefix these data loss numbers with "under normal operation
of our services".

Definitely, exceptional and unpredictable incidents make the absolute risk
higher. But in the absence of historical data about the rate of occurrence of
such events, it is impossible to make a reliable assessment of absolute risk.
Not only is it impossible to assign a realistic probability to events that you
know can happen (eg. catastrophic bug), but many of the possible incidents are
events you are not even aware can happen (eg. are you going to factor the risk
of a meteoric crash destroying your data centers?).

------
rcthompson
If your service has existed for T years and you store N objects and have never
lost any objects, then at most you can claim "less than 100/(T*N)%" yearly
object loss rate. If you claim any less than that, you have no empirical
evidence to back it up.

(The more accurate formula would take into account the length of time each
object has been stored, since not all objects were stored for the full time
period.)

------
cordite
Didn't Netflix at one point have data loss on S3 because an amazon employee
was in production without knowing it?

------
ecma
This is marketing 101, they are making their product sound impressive by
throwing improbable numbers but not even the moron in a hurry would truly
believe that they can store their 1 object and then check in on it in 100
billion years later. The important number (the 99.99...%) is there in the
text, the rest is a hyperbolic and arbitrary example but it is only intended
to provide a frame of reference to numbers which are not easily comprehendable
into an MTBF.

I would also suggest they may not be using historical data to calculate their
object durability, rather a worst case calculation of current failure models.
That's pretty standard practice in all future risk modelling, especially for
new systems. Comparing theoretical future risk vs evaluated past resilience is
a slightly different beast.

