
The probability of data loss in large clusters - pimeys
http://martin.kleppmann.com/2017/01/26/data-loss-in-large-clusters.html
======
colin_mccabe
Unfortunately, this is not a very realistic analysis.

Martin mentions HDFS in the first paragraph as an example. But HDFS does not
use partitions or consistent hashing for data assignment. All replicas in HDFS
are placed manually according to the placement policy currently in effect.
Each block can have a completely different set of nodes. He mentions Kafka as
"typically deployed in... JBOD configuration." But Kafka is actually typically
deployed over a RAID configuration. JBOD support in Kafka is fairly new and
still not often used. It would have been better to mention only the systems
that this analysis is focused on (Cassandra might be one such).

This analysis leaves out a critical factor, which is that the distributed
system will start re-replicating each partition once nodes are declared to be
lost. Clearly, if this can be completed before any other nodes fail, no data
will be lost. This is the reason why you want more partitions-- not less as
this essay recommends. With more, smaller partitions, you can re-replicate
quicker.

To see this, consider a thought experiment where there is one partition per
node in a 5,000 node cluster. In this case, each gigantic partition will be
replicated to three different nodes, and the nodes will be in lockstep with
each other. But if one of those nodes is lost, only 2 other machines in the
cluster can help with re-replication. It will be slow. In contrast, if you
have 10,000 partitions per node, and you lose a node, you will be able to use
the resources of the entire cluster for re-replication (assuming you have
well-distributed partitions). This means 5,000 disks spinning, 5,000 network
interfaces passing data, instead of 2.

~~~
NickM
_This analysis leaves out a critical factor, which is that the distributed
system will start re-replicating each partition once nodes are declared to be
lost. Clearly, if this can be completed before any other nodes fail, no data
will be lost. This is the reason why you want more partitions-- not less as
this essay recommends. With more, smaller partitions, you can re-replicate
quicker._

In practice though, it's not uncommon for multiple machines to fail at once:
maybe you lose power to a rack, or a network switch dies. In this case your
partitioning scheme really does matter if you want to minimize the odds of
data loss. Using a smaller number of larger partitions can decrease the odds
of data loss (though, at the expense of losing a larger amount of data if/when
you _do_ lose all N replicas of a partition).

Using many smaller partitions can indeed speed recovery, and sometimes this is
the right tradeoff to make. But, suppose _any_ amount of data loss means you
have to take your cluster offline and rebuild your entire dataset from
scratch: in this case, you may be better off with a small number of partitions
per node.

~~~
batbomb
That's effectively a partition, not data loss. It is good to be rack-aware
though for data placement to still be available in the case of rack loss.

------
NickM
For a more in-depth analysis of these kinds of effects, see:

[http://web.stanford.edu/~skatti/pubs/usenix13-copysets.pdf](http://web.stanford.edu/~skatti/pubs/usenix13-copysets.pdf)

The authors basically try to determine "if all data is replicated to N nodes,
how can we minimize the probability of data loss when a random N nodes go down
at once?" There are some really interesting tradeoffs that arise in the
process of answering that question.

------
woliveirajr
I remember reading, some time ago [1] [2] about how many errors were expected
without a complete failure of the HDD. I mean, the odds that you just read a
rotten byte out from your disk. And how that would be enough to cause data
loss, since you don't need 3 complete failures to loose data: perhaps 1
complete failure plus 2 partial HDD failures...

And as HDD sizes grew larger, things would become complicate. To the point
that the probability of loosing the second disk while still restoring the
first one, since HDD sizes were so big that the time it takes to rebuild a
whole 3TB disk is huge, in a environment that is in production, reading and
writing things at the same time you're reconstructing the lost one.

[1] This is one of such articles:
[http://searchdatabackup.techtarget.com/video/Preston-As-
hard...](http://searchdatabackup.techtarget.com/video/Preston-As-hard-drive-
capacity-grows-so-does-risk-of-data-loss)

[2] this is another: [http://www.zdnet.com/article/why-raid-6-stops-working-
in-201...](http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/)

~~~
dom0
Even with datasheet values -- rebuilding a RAID5 with 4 TB or even larger
disks has a _significant_ chance of failure (iirc about 15 % for HGST
MegaScale 4 TB drives; since the BER is pretty much a constant, it will be
much worse with larger drives. RAID5 probably does not make much sense with 8,
10 or 12 TB drives.).

~~~
AstralStorm
This is why RAID6 and similar are used. Or tree like structure of RAID10.
Mostly because the probability of a HDD dying whole is lower than in part, you
get much more of the extra redundancy than expected.

These should be also patrolled and resilvered regularly, which should catch
imminent failures too.

------
adt2bt
I'm surprised Erasure Coding didn't come up here. It greatly increases
durability and storage space required at the cost of higher latency reads.

[https://en.wikipedia.org/wiki/Erasure_code](https://en.wikipedia.org/wiki/Erasure_code)

~~~
siculars
Erasure encoding does not greatly increase storage space consumption. Consider
a 2n or 3n replication factor in systems like hdfs, Cassandra and Riak. In
those cases storage space consumption is 200% and 300%, respectively. Erasure
encoding allows you to select the percentage increase in space consumed for
some level of durability. You can, for instance, confirgure the system to
encode a given blob of data at 1.5x (150%) space consumption and then stripe
that data across 10 or so locations, only requiring 6 or 7 of them to
reconstruct the original blob. Facebook put a post out about this some while
back re. their long term picture storage.

~~~
adt2bt
You're correct, I mistyped & intended to say what you wrote.

------
MeteorMarc
He basically says that the chance of loosing one unit of data linearly
increases with the number of units of data the cluster contains. That does not
surprise me.

------
Terribledactyl
The largest error in this analysis is the arbitrary picking of a .1% failure
rate for some arbitrary time period. In order for this analysis to make sense,
it should be right at the probability of failure for the time period it takes
to replicate back to 3 copies. What makes this hard to compute is that drives
don't have a linear death rate in regards to age, it's more "bathtub" or `U`
shaped. And this number should really be a function of the ages of all the
disks.

Drives seem to have a sweet spot of reliability after a few months and up
until a 2-4 years that is really low.

------
mutagen
Has Backblaze done any digging through their data for hard disk failure
correlations among manufacturers? I know some manufacturers show higher rates
than others but within same model have they seen spikes of failures at around
the same age, # of reads/writes, etc?

Are there any other large public hard drive failure datasets that could
contribute to refining the type of analysis in this article?

~~~
geforce
They do statistics a few times a year:

[https://www.backblaze.com/blog/hard-drive-failure-
rates-q3-2...](https://www.backblaze.com/blog/hard-drive-failure-
rates-q3-2016/)

------
sambe
Even ignoring counter-intuitive probabilities, the law of large numbers is
often surprising:

[http://dl.acm.org/citation.cfm?doid=347059.347561](http://dl.acm.org/citation.cfm?doid=347059.347561)

If you deal with large data, you might have experienced such issues already -
hope you have application level hashes on top!

------
u320
Over time, data loss is the same regardless of cluster size. It's just that as
cluster size grows, you go from "lose 100% of your data in 0.1% of time
periods" to "lose 0.1% of your data in 100% of time periods". This analysis
looks at only a single time period so that effect is obscured.

------
tush726
He's leaving out a critical factor which is the replication time (in relation
to the time used for the failure chance). The probabilities will be lower if
you take it into account unless you have very long replication times.

