Hacker News new | past | comments | ask | show | jobs | submit login
The probability of data loss in large clusters (kleppmann.com)
87 points by pimeys on Jan 26, 2017 | hide | past | web | favorite | 21 comments

Unfortunately, this is not a very realistic analysis.

Martin mentions HDFS in the first paragraph as an example. But HDFS does not use partitions or consistent hashing for data assignment. All replicas in HDFS are placed manually according to the placement policy currently in effect. Each block can have a completely different set of nodes. He mentions Kafka as "typically deployed in... JBOD configuration." But Kafka is actually typically deployed over a RAID configuration. JBOD support in Kafka is fairly new and still not often used. It would have been better to mention only the systems that this analysis is focused on (Cassandra might be one such).

This analysis leaves out a critical factor, which is that the distributed system will start re-replicating each partition once nodes are declared to be lost. Clearly, if this can be completed before any other nodes fail, no data will be lost. This is the reason why you want more partitions-- not less as this essay recommends. With more, smaller partitions, you can re-replicate quicker.

To see this, consider a thought experiment where there is one partition per node in a 5,000 node cluster. In this case, each gigantic partition will be replicated to three different nodes, and the nodes will be in lockstep with each other. But if one of those nodes is lost, only 2 other machines in the cluster can help with re-replication. It will be slow. In contrast, if you have 10,000 partitions per node, and you lose a node, you will be able to use the resources of the entire cluster for re-replication (assuming you have well-distributed partitions). This means 5,000 disks spinning, 5,000 network interfaces passing data, instead of 2.

Also on multi-node failures smaller partitions actually reduce the amount of critical data to re-replicate to keep the probability low and even give you ability to prioritize that.

For example losing two nodes with 100 partitions on each might give you only like 5 partitions that happen to have replicas living on both nodes and that need to be re-replicated as soon as possible from a single still existing replica, but the rest of the replicas for 190 different partitions have 2 existing replicas and can wait til those 5 high-priority ones are re-replicated.

I guess this answers Martin's question - "you can choose between a high probability of losing a small amount of data, and a low probability of losing a large amount of data! Is the latter better?" - no, the latter is worse, much-much worse.

This analysis leaves out a critical factor, which is that the distributed system will start re-replicating each partition once nodes are declared to be lost. Clearly, if this can be completed before any other nodes fail, no data will be lost. This is the reason why you want more partitions-- not less as this essay recommends. With more, smaller partitions, you can re-replicate quicker.

In practice though, it's not uncommon for multiple machines to fail at once: maybe you lose power to a rack, or a network switch dies. In this case your partitioning scheme really does matter if you want to minimize the odds of data loss. Using a smaller number of larger partitions can decrease the odds of data loss (though, at the expense of losing a larger amount of data if/when you do lose all N replicas of a partition).

Using many smaller partitions can indeed speed recovery, and sometimes this is the right tradeoff to make. But, suppose any amount of data loss means you have to take your cluster offline and rebuild your entire dataset from scratch: in this case, you may be better off with a small number of partitions per node.

That's effectively a partition, not data loss. It is good to be rack-aware though for data placement to still be available in the case of rack loss.

Let's Supose each chunk of data is one 3 machines.

If each each machine has 3 chunks you now risk data loss of those 2 machines fail before they offload data. However if each machine holds 3X munger of machine chunks then if any 2 machines fail you lose data instead of 2 specific machines failing. So even though you replicate faster, your risk of some data loss is actually higher.

Basically you risk data loss over 1/5,000 the time but have 5000 * 4999 more ways to fail. The only real upside is the actual data loss is likely to be very small.

There's an obvious tradeoff between number of partitions and overhead though. Too many, and the increased network traffic will impact the usefulness of the cluster, especially if they need to be fetched from multiple racks for a single computation.

You sound well informed, I just wanted to point out that HFDS replication is not a simple problem.

Also, hopefully you have more than one disk per node (assuming physical machines) ;)

For a more in-depth analysis of these kinds of effects, see:


The authors basically try to determine "if all data is replicated to N nodes, how can we minimize the probability of data loss when a random N nodes go down at once?" There are some really interesting tradeoffs that arise in the process of answering that question.

I remember reading, some time ago [1] [2] about how many errors were expected without a complete failure of the HDD. I mean, the odds that you just read a rotten byte out from your disk. And how that would be enough to cause data loss, since you don't need 3 complete failures to loose data: perhaps 1 complete failure plus 2 partial HDD failures...

And as HDD sizes grew larger, things would become complicate. To the point that the probability of loosing the second disk while still restoring the first one, since HDD sizes were so big that the time it takes to rebuild a whole 3TB disk is huge, in a environment that is in production, reading and writing things at the same time you're reconstructing the lost one.

[1] This is one of such articles: http://searchdatabackup.techtarget.com/video/Preston-As-hard...

[2] this is another: http://www.zdnet.com/article/why-raid-6-stops-working-in-201...

Even with datasheet values -- rebuilding a RAID5 with 4 TB or even larger disks has a significant chance of failure (iirc about 15 % for HGST MegaScale 4 TB drives; since the BER is pretty much a constant, it will be much worse with larger drives. RAID5 probably does not make much sense with 8, 10 or 12 TB drives.).

This is why RAID6 and similar are used. Or tree like structure of RAID10. Mostly because the probability of a HDD dying whole is lower than in part, you get much more of the extra redundancy than expected.

These should be also patrolled and resilvered regularly, which should catch imminent failures too.

I'm surprised Erasure Coding didn't come up here. It greatly increases durability and storage space required at the cost of higher latency reads.


Erasure encoding does not greatly increase storage space consumption. Consider a 2n or 3n replication factor in systems like hdfs, Cassandra and Riak. In those cases storage space consumption is 200% and 300%, respectively. Erasure encoding allows you to select the percentage increase in space consumed for some level of durability. You can, for instance, confirgure the system to encode a given blob of data at 1.5x (150%) space consumption and then stripe that data across 10 or so locations, only requiring 6 or 7 of them to reconstruct the original blob. Facebook put a post out about this some while back re. their long term picture storage.

You're correct, I mistyped & intended to say what you wrote.

Erase encoding was my first thought too. Here's a link to the paper, for anyone who's interested:


He basically says that the chance of loosing one unit of data linearly increases with the number of units of data the cluster contains. That does not surprise me.

The largest error in this analysis is the arbitrary picking of a .1% failure rate for some arbitrary time period. In order for this analysis to make sense, it should be right at the probability of failure for the time period it takes to replicate back to 3 copies. What makes this hard to compute is that drives don't have a linear death rate in regards to age, it's more "bathtub" or `U` shaped. And this number should really be a function of the ages of all the disks.

Drives seem to have a sweet spot of reliability after a few months and up until a 2-4 years that is really low.

Has Backblaze done any digging through their data for hard disk failure correlations among manufacturers? I know some manufacturers show higher rates than others but within same model have they seen spikes of failures at around the same age, # of reads/writes, etc?

Are there any other large public hard drive failure datasets that could contribute to refining the type of analysis in this article?

Even ignoring counter-intuitive probabilities, the law of large numbers is often surprising:


If you deal with large data, you might have experienced such issues already - hope you have application level hashes on top!

Over time, data loss is the same regardless of cluster size. It's just that as cluster size grows, you go from "lose 100% of your data in 0.1% of time periods" to "lose 0.1% of your data in 100% of time periods". This analysis looks at only a single time period so that effect is obscured.

He's leaving out a critical factor which is the replication time (in relation to the time used for the failure chance). The probabilities will be lower if you take it into account unless you have very long replication times.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact