

Usenix ATC best student paper award on distributed storage  - sqrtnlogn
http://www.stanford.edu/~cidon/materials/Usenix%20Final.pdf

======
tomp
TL;DR:

Datacenter operators incur a significant cost if after a cluster-wide power
outage some nodes fail permanently; finding chunks of data that are lost (i.e.
all replicas failed) is a big fixed cost, so it is in their interest to reduce
the probability of data loss at the expense of increasing the magnitude of
data loss (i.e. you lose data less often, but when you do, you lose more
data).

> The probability of data loss is minimized when each node is a member of
> exactly one copyset. For example, assume our system has 9 nodes with
> R[eplication]= 3 that are split into three copy sets: {1, 2, 3}, {4, 5, 6},
> [and] {7, 8, 9}. Our system would only lose data if nodes 1, 2 and 3, nodes
> 4, 5 and 6 or nodes 7, 8 and 9 fail simultaneously.

> In contrast, with random replication and a sufﬁcient number of chunks, any
> combination of 3 nodes would be a copyset, and any combination of 3 nodes
> that fail simultaneously would cause data loss.

In the case above, in case of single node failure there only 2 other nodes
from which new replacement node can bootstrap. They then relax the constraint
that one node belongs to only one Copyset, which slightly increases the
probability of data loss, but speeds up recovery from partial failure.

------
eigenrick
It seems that this is just a structured way to formally keep more copies of
your data, when what you're trying to avoid is a rack level event removing
availability of your replicas.

Ceph, described here [http://ceph.com/papers/weil-
thesis.pdf](http://ceph.com/papers/weil-thesis.pdf) does just that by letting
you include the structure of your datacenter in the pseudorandom,
deterministic placement algorithm it uses for placing reads and writes.

