Martin mentions HDFS in the first paragraph as an example. But HDFS does not use partitions or consistent hashing for data assignment. All replicas in HDFS are placed manually according to the placement policy currently in effect. Each block can have a completely different set of nodes. He mentions Kafka as "typically deployed in... JBOD configuration." But Kafka is actually typically deployed over a RAID configuration. JBOD support in Kafka is fairly new and still not often used. It would have been better to mention only the systems that this analysis is focused on (Cassandra might be one such).
This analysis leaves out a critical factor, which is that the distributed system will start re-replicating each partition once nodes are declared to be lost. Clearly, if this can be completed before any other nodes fail, no data will be lost. This is the reason why you want more partitions-- not less as this essay recommends. With more, smaller partitions, you can re-replicate quicker.
To see this, consider a thought experiment where there is one partition per node in a 5,000 node cluster. In this case, each gigantic partition will be replicated to three different nodes, and the nodes will be in lockstep with each other. But if one of those nodes is lost, only 2 other machines in the cluster can help with re-replication. It will be slow. In contrast, if you have 10,000 partitions per node, and you lose a node, you will be able to use the resources of the entire cluster for re-replication (assuming you have well-distributed partitions). This means 5,000 disks spinning, 5,000 network interfaces passing data, instead of 2.
For example losing two nodes with 100 partitions on each might give you only like 5 partitions that happen to have replicas living on both nodes and that need to be re-replicated as soon as possible from a single still existing replica, but the rest of the replicas for 190 different partitions have 2 existing replicas and can wait til those 5 high-priority ones are re-replicated.
I guess this answers Martin's question - "you can choose between a high probability of losing a small amount of data, and a low probability of losing a large amount of data! Is the latter better?" - no, the latter is worse, much-much worse.
In practice though, it's not uncommon for multiple machines to fail at once: maybe you lose power to a rack, or a network switch dies. In this case your partitioning scheme really does matter if you want to minimize the odds of data loss. Using a smaller number of larger partitions can decrease the odds of data loss (though, at the expense of losing a larger amount of data if/when you do lose all N replicas of a partition).
Using many smaller partitions can indeed speed recovery, and sometimes this is the right tradeoff to make. But, suppose any amount of data loss means you have to take your cluster offline and rebuild your entire dataset from scratch: in this case, you may be better off with a small number of partitions per node.
If each each machine has 3 chunks you now risk data loss of those 2 machines fail before they offload data. However if each machine holds 3X munger of machine chunks then if any 2 machines fail you lose data instead of 2 specific machines failing. So even though you replicate faster, your risk of some data loss is actually higher.
Basically you risk data loss over 1/5,000 the time but have 5000 * 4999 more ways to fail. The only real upside is the actual data loss is likely to be very small.
You sound well informed, I just wanted to point out that HFDS replication is not a simple problem.
Also, hopefully you have more than one disk per node (assuming physical machines) ;)
The authors basically try to determine "if all data is replicated to N nodes, how can we minimize the probability of data loss when a random N nodes go down at once?" There are some really interesting tradeoffs that arise in the process of answering that question.
And as HDD sizes grew larger, things would become complicate. To the point that the probability of loosing the second disk while still restoring the first one, since HDD sizes were so big that the time it takes to rebuild a whole 3TB disk is huge, in a environment that is in production, reading and writing things at the same time you're reconstructing the lost one.
 This is one of such articles: http://searchdatabackup.techtarget.com/video/Preston-As-hard...
 this is another: http://www.zdnet.com/article/why-raid-6-stops-working-in-201...
These should be also patrolled and resilvered regularly, which should catch imminent failures too.
Drives seem to have a sweet spot of reliability after a few months and up until a 2-4 years that is really low.
Are there any other large public hard drive failure datasets that could contribute to refining the type of analysis in this article?
If you deal with large data, you might have experienced such issues already - hope you have application level hashes on top!