I love a good storage story. Its interesting that they still put them behind a couple of gigabit network ports. Using the native network (2 x 1GbE) it would take more than a week with those interfaces on full dump to get a full load off or on to a pod.
I had an ultimately unrewarding conversation with Sean Quinlan (of Google GFS fame) about the futility of putting a lot of storage behind such a small channel (in Google's case the numbers were epicly Google of course but the argument was the same). You waste all of the spindles because the operation rate (requests coming into the channel) vs the amount of data ops needed to satisfy the request, basically leave your disks waiting around the next request to come in from the network. (btw that allows you to make a nearly perfect emission rate scheduler for disk arms but that is another story).
What this means is that petabyte pods are going to be nearly useless, although with an external index they can be dense.
I could see it being a problem for Google, but Backblaze wants these for archival purposes, not something where there is going to be a lot of reading and writing. The write rate is going to be whatever speed their users upload stuff, divided by the total number of their storage pods. I assume this is relatively small. The read rate is going to be whatever speed their users download restores, divided by the total number of storage pods, which is probably much smaller still.
The assumption here is that data is kept for a long time relative to how frequently it's written and read, so the IO speed probably isn't that big of a deal.
No. As you said port speed doesn't matter for data at rest. What matters is ingest/exfil of data due to "exceptional" conditions. Prime cases are cluster/mirror failure. Remirroring existing data to another pod is port limited, as is ingest for pods that are remirror targets.
Is there any reason the sources and targets couldn't both be thoroughly distributed throughout the cluster? Nothing says hard drives have to be perfectly replicated, you just need multiple copies of the data. I'm imagining that a HD dies, and the extra copies of what it contained are scattered all over. You re-replicate them by scattering them further all over. No one pod has to move any substantial amount of data.
Sure. You can absolutely replicate chunks. But you start kicking the problem upstream. A rack down is a couple pb, so you start doing a ton of cross rack transfers to get your replica counts back up. Now you're gated on nic/TOR/agg switch throughput. A DC down and you're gated on nics TORs Aggs & intra DC network. And this keeps adding up $$$ the further you get.
Ms had an interesting paper on data locality in storage last year. Can't recall the title offhand though.